Web Scraping with R & rvest

Dr. Matthew Hendrickson

July 9, 2020

About Me

  • Dr. Matthew Hendrickson
  • Social Scientist by Training
    • Psychology & Music %>%
    • Cognitive & Social Psychology %>%
    • Law & Policy
  • Professional Experience (13+ years)
    • Higher Education Analyst
    • Independent Consultant
    • Research projects, data analysis, policy development, strategy, analytics pipeline solutions

Topics

  1. A Little About Web Scraping
  2. Robots!
  3. HTML & CSS
  4. The Setup
  5. Scraping the Data
  6. Assembling the Data
  7. References & Resources

A Little About Web Scraping

“Web scraping is the process of automatically mining data or collecting information from the World Wide Web” – Wikipedia


Web scraping is a flexible method to extract numerical or textual data from the internet

Use Cases

There are many uses for web scraping, including:

  1. Price monitoring
  2. Time series tracking & analysis
  3. Sentiment analysis
  4. Brand monitoring
  5. Market analysis
  6. Lead generation

Robots!

  • No, not those robots!
  • Always ensure PRIOR to scraping that you have scraping rights!
  • This is critical as you can be blocked or even face legal action!

Robots.txt

You can easily check with the robotstxt package

paths_allowed(paths = c("https://netflix.com/"))
#> [1] FALSE

Netflix does not allow you to scrape their site

HTML & CSS


Hyper Text Markup Language

“HTML is the standard markup language for creating Web pages”



Cascading Style Sheets

“CSS describes how HTML elements are to be displayed on screen, paper, or in other media”

– W3Schools

HTML Structure

Image credit: Professor Shawn Santo

HTML Tags

HTML is structured with “tags” indicating portions of a page

Tags can be called by their structure

Tags can be nested

A few important tags (of many) for scraping:

  • <h1> header tags </h1>
  • <p> paragraph elements </p>
  • <ul> unordered bulleted list </ul>
  • <ol> ordered list </ol>
  • <li> individual list item </li>
  • <div> division </div>
  • <table> table </table>

A Little Help with CSS

Extracting parts of a website can be daunting if unfamiliar with CSS

SelectorGadget is helpful (Chrome only)

Inspect the page elements is also helpful

Scraping Methods

HTML - syntax is easier & aligns with HTML tags

XPATH - useful when the node isn’t uniquely identified with CSS

The Setup

library(robotstxt)
library(rvest)
library(tidyverse)

That’s it!

Determine a website to scrape

Seems appropriate to pull R book data from Amazon

paths_allowed(paths = c("https://amazon.com/"))
#> [1] TRUE


We are good to scrape!

Specify the URL

amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")

Data as of 2020-07-07

Titles

Scraping book titles

amazon %>% 
  html_nodes(".s-line-clamp-2") %>% 
  html_text() -> titles
head(titles)
#> [1] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n            \n        \n        \n    \n\n\n    \n"          
#> [2] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                The Book of R: A First Course in Programming and Statistics\n            \n        \n        \n    \n\n\n    \n"                     
#> [3] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Discovering Statistics Using R\n            \n        \n        \n    \n\n\n    \n"                                                  
#> [4] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R Graphics Cookbook: Practical Recipes for Visualizing Data\n            \n        \n        \n    \n\n\n    \n"                     
#> [5] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Advanced R, Second Edition (Chapman & Hall/CRC The R Series)\n            \n        \n        \n    \n\n\n    \n"                    
#> [6] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)\n            \n        \n        \n    \n\n\n    \n"

Removing \n & white space from the titles

titles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"          
#> [2] "The Book of R: A First Course in Programming and Statistics"                     
#> [3] "Discovering Statistics Using R"                                                  
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"                     
#> [5] "Advanced R, Second Edition (Chapman & Hall/CRC The R Series)"                    
#> [6] "Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)"

Formats

Scraping the book format

amazon %>% 
  html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>% 
  html_text() -> format
head(format)
#> [1] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [2] "\n    \n        \n        \n            Kindle\n        \n    \n"   
#> [3] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [4] "\n    \n        \n        \n            eTextbook\n        \n    \n"
#> [5] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [6] "\n    \n        \n        \n            Kindle\n        \n    \n"

Clean up book format values

format <- str_trim(format)
head(format)
#> [1] "Paperback" "Kindle"    "Paperback" "eTextbook" "Paperback" "Kindle"

Price

Scraping the book price

amazon %>% 
  html_nodes(".a-price-whole") %>% 
  html_text() -> price_whole
head(price_whole)
#> [1] "40." "24." "33." "29." "34." "61."

Scraping (the rest of) the book price

amazon %>% 
  html_nodes(".a-price-fraction") %>% 
  html_text() -> price_fraction
head(price_fraction)
#> [1] "10" "99" "04" "99" "37" "60"

Combine price portions

price <- paste(price_whole, price_fraction, sep = "")
head(price)
#> [1] "40.10" "24.99" "33.04" "29.99" "34.37" "61.60"

Make it numeric

price <- as.numeric(price)
head(price)
#> [1] 40.10 24.99 33.04 29.99 34.37 61.60

Rating

Scraping the book rating

amazon %>% 
  html_nodes("i.a-icon.a-icon-star-small.aok-align-bottom") %>% 
  html_text() -> rating
head(rating)
#> [1] "4.7 out of 5 stars" "4.3 out of 5 stars" "4.5 out of 5 stars"
#> [4] "4.7 out of 5 stars" "4.8 out of 5 stars" "4.4 out of 5 stars"

Trim into a usable metric

rating <- substr(rating, 1, 3)
head(rating)
#> [1] "4.7" "4.3" "4.5" "4.7" "4.8" "4.4"

Make it numeric

rating <- as.numeric(rating)
head(rating)
#> [1] 4.7 4.3 4.5 4.7 4.8 4.4

Rating Counts

Scraping the book rating count

amazon %>% 
  html_nodes("div.a-row.a-size-small") %>% 
  html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427\n            \n        \n        \n    \n\n\n\n"
#> [2] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76\n            \n        \n        \n    \n\n\n\n" 
#> [3] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255\n            \n        \n        \n    \n\n\n\n"
#> [4] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n" 
#> [5] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.8 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                31\n            \n        \n        \n    \n\n\n\n" 
#> [6] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.4 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n"

Trim the rating count

rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                427"
#> [2] "4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                76" 
#> [3] "4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255"
#> [4] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14" 
#> [5] "4.8 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                31" 
#> [6] "4.4 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14"

Rating count - substring

rate_n <- str_sub(rate_n, -5)
head(rate_n)
#> [1] "  427" "   76" "  255" "   14" "   31" "   14"

Trim the rating count (again)

rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "427" "76"  "255" "14"  "31"  "14"

Set as numeric

rate_n <- as.numeric(rate_n)
head(rate_n)
#> [1] 427  76 255  14  31  14

Publication Date

Scraping the book publication date

amazon %>% 
  html_nodes("span.a-size-base.a-color-secondary.a-text-normal") %>% 
  html_text() -> pub_dt
head(pub_dt)
#> [1] "Jan 10, 2017" "Jul 16, 2016" "Apr 5, 2012"  "Nov 30, 2018" "May 30, 2019"
#> [6] "Dec 5, 2018"

Convert to a date

pub_dt <- as.Date(pub_dt, "%b %d, %Y")
head(pub_dt)
#> [1] "2017-01-10" "2016-07-16" "2012-04-05" "2018-11-30" "2019-05-30"
#> [6] "2018-12-05"

We Have the Pieces

Let’s assemble the file!

  1. Titles
  2. Formats
  3. Prices
  4. Ratings
  5. Rating Counts
  6. Publication Date

Check the scrapes

length(titles)
#> [1] 16
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 14
length(rate_n)
#> [1] 14
length(pub_dt)
#> [1] 16

Wait! What?!?

What Happened?

Sometimes you get an uneven number of records in the scrape

We can fix this!


…manually…

Fixing the Scrapes

Fixing Titles

Titles scraped accurately, but have multiple formats

titles %>% 
  rep(, each = 2) -> titles
length(titles)
#> [1] 32

Fixing Titles

Some books have 3 formats

titles %>% 
  append(values = titles[15], after = 15) %>% 
  append(values = titles[11], after = 11) %>% 
  append(values = titles[9], after = 9) %>% 
  append(values = titles[5], after = 5) -> titles
length(titles)
#> [1] 36

Fixing Formats

Nothing needed here!

length(format)
#> [1] 36

Fixing Prices

Or here!

length(price)
#> [1] 36

Fixing Ratings

Books missing ratings

rating %>% 
  append(values = NA, after = 7) %>% 
  append(values = NA, after = 11) -> rating
length(rating)
#> [1] 16

Fixing Ratings

Multiple formats - repeat ratings

rating %>% 
  rep(, each = 2) -> rating
length(rating)
#> [1] 32

Fixing Ratings

Books with 3 formats

rating %>% 
  append(values = rating[15], after = 15) %>% 
  append(values = rating[11], after = 11) %>% 
  append(values = rating[9], after = 9) %>% 
  append(values = rating[5], after = 5) -> rating
length(rating)
#> [1] 36

Fixing Rating Counts

Books missing ratings & rating counts

rate_n %>% 
  append(values = NA, after = 7) %>% 
  append(values = NA, after = 11) -> rate_n
length(rate_n)
#> [1] 16

Fixing Rating Counts

Multiple formats - repeat rating counts

rate_n %>% 
  rep(, each = 2) -> rate_n
length(rate_n)
#> [1] 32

Fixing Rating Counts

Books with 3 formats

rate_n %>% 
  append(values = rate_n[15], after = 15) %>% 
  append(values = rate_n[11], after = 11) %>% 
  append(values = rate_n[9], after = 9) %>% 
  append(values = rate_n[5], after = 5) -> rate_n
length(rate_n)
#> [1] 36

Fixing Publication Date

Multiple formats - repeat publication dates

pub_dt %>% 
  rep(, each = 2) -> pub_dt
length(pub_dt)
#> [1] 32

Fixing Publication Date

Books with 3 formats

pub_dt %>% 
  append(values = pub_dt[15], after = 15) %>% 
  append(values = pub_dt[11], after = 11) %>% 
  append(values = pub_dt[9], after = 9) %>% 
  append(values = pub_dt[5], after = 5) -> pub_dt
length(pub_dt)
#> [1] 36

One Last Check!

length(titles)
#> [1] 36
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 36
length(rate_n)
#> [1] 36
length(pub_dt)
#> [1] 36

(Finally) Assemble the Data

r_books <- tibble(title            = titles,
                  text_format      = format,
                  price            = price,
                  rating           = rating,
                  num_ratings      = rate_n,
                  publication_date = pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#>   title                    text_format price rating num_ratings publication_date
#>   <chr>                    <chr>       <dbl>  <dbl>       <dbl> <date>          
#> 1 R for Data Science: Imp~ Paperback    40.1    4.7         427 2017-01-10      
#> 2 R for Data Science: Imp~ Kindle       25.0    4.7         427 2017-01-10      
#> 3 The Book of R: A First ~ Paperback    33.0    4.3          76 2016-07-16      
#> 4 The Book of R: A First ~ eTextbook    30.0    4.3          76 2016-07-16      
#> 5 Discovering Statistics ~ Paperback    34.4    4.5         255 2012-04-05      
#> 6 Discovering Statistics ~ Kindle       61.6    4.5         255 2012-04-05

References & Resources

References & Resources continued

Thank you


@mjhendrickson


matthewjhendrickson


mjhendrickson


Web Scraping in R & rvest on GitHub

This talk is freely distributed under the MIT License.